Automated field position linking of indexed data to digital images

ABSTRACT

According to one embodiment, a method of linking a document image with indexed data is provided. The method may be performed by providing a document image, which is a digitized document having various information. Indexed data is also provided, which is a record that includes information extracted from the document image or a different document image. The process is further performed by identifying a feature of the indexed data and analyzing the document image to determine whether the feature is present within the information of the digitized document. The feature may be information or a characteristic defined by the information extracted from the document image or the different document image. If the feature is present within the information of the digitized document, a determination is made that the indexed data corresponds with the document image and the indexed data is linked with the document image.

CROSS REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.13/601,111 filed on Aug. 31, 2012, which is hereby expresslyincorporated by reference in its entirety for all purposes as if fullyset forth herein.

BACKGROUND OF THE INVENTION

The present invention relates generally to methods and systems forlinking documents and more specifically to methods and systems forlinking document images with corresponding indexed data.

Currently, data-image linking is mainly a manual process that normallyinvolves the use of manual interactive software tools that enable keyedor indexed data sets to be viewed, inspected, and linked or synched upwith source images. The process often involves a digitized image beingpresented to a user along with corresponding keyed data in a sequentialmanner, requiring the user to verify sequential matches and make manualadjustments where either the data or images are off. There are severalproblems with current processes. For example, linking or synching imagesets to keyed data sets in this manner is a tedious and error proneprocess.

It is true that in the simplest case where keyed or indexed data isproduced from a given set of images, it should not be necessary to linkthe data to the appropriate images, at least as long as the indexingprocess is careful to maintain the proper association between the indexdata and the corresponding image from which that data was keyed. Inpractice, however, it often becomes advantageous and/or even necessaryto link or sync-up indexed datasets to corresponding images.

In industries working with historical documents, which documents mayspan hundreds of years, the efforts to produce, duplicate, preserve,print, digitize, and the like, the historical documents has increaseddramatically over time. Due to the efforts of numerous libraries,archives, and other organizations to preserve these documents fromgeneration to generation, multiple copies of the documents usuallyexist. Furthermore, the documents often exist in multiple formats. Thedocuments may have originally been handwritten on hand-drawn ormachine-printed paper forms. The documents may then have beenphotographed or microfilmed/microfiched, and duplicated through anynumber of copies before being ultimately scanned or imaged (i.e.,digitized) using a wide variety of modern digital imaging devices.Hence, there are typically many “sources” or copies of the documents orimages for a given collection. The quality of the source image in termsof size, resolution, legibility, and the like can vary widely.Furthermore, the possibility of duplicate images, missing images,damaged images, and other such variations between image sets may lead tosituations where even the count and sequence of images in thesecollections is inconsistent.

Meanwhile, multiple organizations continue to work to preserve andprovide access to these collections. Therefore, in addition to multiplesources of documents or images, there are often multiple keyed orindexed “datasets” produced from various sets of images for a givencollection. Consequently, there frequently exists a many-to-manyrelationship between “image-sets” and “datasets” for any givencollection.

BRIEF SUMMARY OF THE INVENTION

Embodiments of the invention describe methods and systems for linkingdocuments and datasets. According to one aspect, a method of linking aset of document images with an indexed dataset is provided. The methodmay include providing a set of document images, where the documentimages are digitized documents having information about individuals,places, or things. The method may also include providing an indexeddataset. The indexed dataset may include data records that each includeinformation extracted or keyed from a document image of the set ofdocument images or a similar set of document images. A first data recordmay be selected from the indexed dataset. A first feature may beidentified from the first data record. The first feature may be definedby a set of the information extracted or keyed from one of the documentimages. Similarly, a second feature may be identified from the firstdata record, the second feature being defined by a subset of theinformation extracted or keyed from the one of the document images. Afirst document image may be selected from the set of document images.The first document image may be analyzed to determine that the firstfeature and second feature are present within the information of thefirst document image. Based on the presence of the first feature and thesecond feature within the information of the first document image, adetermination may be made that the first data record corresponds withthe first document image.

The method may additional include selecting a second data record fromthe indexed dataset. A third and fourth feature may be identified fromthe second data record, where the third feature is defined by a set ofthe information extracted or keyed from a different one of the documentimages and the fourth feature is defined by a subset of the informationextracted or keyed from the different one of the document images. Asecond document image may be selected from the set of document imagesand may be analyzed to determine that the third feature and fourthfeature are present within the information of the second document image.Based on the presence of the third feature and fourth feature within theinformation of the second document image, a determination may be madethat the second data record corresponds with the second document image.The method may additionally include linking at least a portion of theindexed dataset with the set of document images.

According to one embodiment, the document images may be arranged insequential order and the data records may also be arranged in sequentialorder. In such embodiments, the method may additionally include:determining a sequential arrangement of the first and second documentimages, determining a sequential arrangement of the first and seconddata records, determining a sequential arrangement for each of theremaining document images and data records based on the respectivedetermined sequential arrangements of the first and second documentimages and the first and second data records, and linking the datarecords with the document images based on the determined sequentialarrangements of the data records and document images.

According to another embodiment, the method may additionally includedetermining that the indexed dataset is missing a data record orincludes a duplicate data record, or determining that the set ofdocument images is missing a document image or includes a duplicatedocument image, and adjusting the sequential order of the correspondingindexed dataset or set of document images based on the determination.According to another embodiment, the method may additionally includelinking the set of document images with a second set of document imagesbased on the step of linking the data records with the document images.The second set of document images may be the digitized documents fromwhich the information for the indexed dataset was extracted or keyed.According to another embodiment, the method may additionally includelinking the indexed dataset with a second indexed dataset based on thestep of linking the data records with the document images. The secondindexed dataset may include information extracted or keyed from thedocuments of the set of document images.

According to one embodiment, the set of document images may includegenealogical documents and the indexed dataset comprises genealogicaldata. Further, the set of document images may be provided from a firstsource and the indexed dataset may be provided from a second source thatis different than the first source.

According to another aspect, a method of linking a document image withindexed data is provided. The method may include providing a documentimage, where the document image is a digitized document havinginformation about an individual, place, or thing. The method mayadditionally include providing indexed data, the indexed data includinginformation extracted or keyed from a different document image. A firstfeature may be identified from the indexed data and the document imagemay be analyzed to determine that the first feature is present withinthe information of the document image. The first feature may be definedby a set of the information extracted or keyed from the differentdocument image. Based on the presence of the first feature within theinformation of the document image, a determination may be made that theindexed data corresponds with the document image. Subsequently, theindexed data may be linked with the document image.

According to some embodiments, the method may further include:identifying a second feature of the indexed data, the second featurebeing defined by a subset of the information extracted or keyed from thedifferent document image; analyzing the document image to determine thatthe second feature is present within the information of the documentimage; and determining that the indexed data corresponds with thedocument image based on the presence of the first feature and the secondfeature within the information of the document image. The first and/orsecond identified feature may include a recognition of a number ofrecord entries of the document image, a recognition of a sex of anindividual or individuals identified on the document image, ahandwriting recognition, a strikethrough recognition, a blank fieldrecognition, a table detection, a field detection, word spotting, andthe like. According to some embodiments, a number of detected recordedentries for the document image may be adjusted based on a detection orrecognition of a strikethrough for an entry.

According to some embodiments, an approximate location on the documentimage may be known for the information within the indexed data and themethod may further include: adjusting the approximate location for theinformation based on a detection of a blank field, row, or column ofinformation on the document image. According to another embodiment, themethod may additionally include: determining a confidence value based onthe presence of the first feature within the information of the documentimage, the confidence value indicating a probability that the indexeddata corresponds with the document image, and adjusting the confidencevalue based on the presence of the second feature within the informationof the document image, the adjusted confidence value indicating anincreased probability that the indexed data corresponds with thedocument image. In such embodiments, a third feature of the indexed datamay be identified, the document image may be analyzed to determine thatthe third feature is present within the information of the documentimage, and the confidence value may be readjusted based on the presenceof the third feature within the information of the document image. Thethird feature may be defined by an additional subset of the informationextracted or keyed from the document image.

According to some embodiments, the document image may be received from afirst source and the indexed data may be received from a second sourcedifferent than the first source. Further, the document image may be adigitized genealogical document and the indexed data may be genealogicaldata.

According to another aspect, a method of linking a first document imagewith a second document image in provided. The method may includeproviding a first document image, where the first document image is adigitized first document having information about an individual, place,or thing. The method may also include providing a second document image,where the second document image is a digitized second document havinginformation about an individual, place, or thing. A first feature of thefirst document image may be identified and a second feature of the firstdocument image may also be identified. The first feature may define orbe defined by a set of the information of the first document, or maydefine an arrangement of the set of information of the first document.Similarly, the second feature may define or be defined by a subset ofthe information of the first document, or may define an arrangement ofthe subset of information of the first document. The second documentimage may be analyzed to determine that the first feature is presentwithin the information of the second document and may be furtheranalyzed to determine that the second feature is present within theinformation of the second document. Based on the presence of the firstfeature and the second feature within the information of the seconddocument image, a determination may be made that the first documentimage and the second document image are digitized forms or versions ofthe same source document. The first document image may be subsequentlylinked with the second document image.

Further, the second document image may be linked with a second indexeddata record, where the second indexed data record includes informationextracted or keyed from the second document image. In such embodiments,the method may further include: linking the second indexed data recordwith the first document image based on the determination that the firstdocument image and the second document image are digitized forms orversions of the same document, or based on the linking of the firstdocument image with the second document image. Similarly, the firstdocument image may be linked with a first indexed data record, where thefirst indexed data record includes information extracted or keyed fromthe first document image. In such embodiments, the method may furtherinclude: linking the first indexed data record with the second documentimage or the second indexed data record based on one or more of thefollowing: the determination that the first document image and thesecond document image are digitized forms or versions of the samedocument, the linking of the first document image with the seconddocument image, or the linking of the second indexed data record withthe first document image.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is described in conjunction with the appendedfigures:

FIGS. 1A-D illustrate various instances where a document image or a setof document images may be linked with a data record or an indexeddataset according to an embodiment of the invention.

FIGS. 2A & B illustrate various document images according to anembodiment of the invention.

FIG. 3 illustrates an exemplary process flow diagram according to anembodiment of the invention.

FIG. 4 illustrates a general purpose computer system that may be used toimplement the methods described herein according to an embodiment of theinvention.

In the appended figures, similar components and/or features may have thesame numerical reference label. Further, various components of the sametype may be distinguished by following the reference label by a letterthat distinguishes among the similar components and/or features. If onlythe first numerical reference label is used in the specification, thedescription is applicable to any one of the similar components and/orfeatures having the same first numerical reference label irrespective ofthe letter suffix.

DETAILED DESCRIPTION OF THE INVENTION

The ensuing description provides exemplary embodiments only, and is notintended to limit the scope, applicability or configuration of thedisclosure. Rather, the ensuing description of the exemplary embodimentswill provide those skilled in the art with an enabling description forimplementing one or more exemplary embodiments. It being understood thatvarious changes may be made in the function and arrangement of elementswithout departing from the spirit and scope of the invention as setforth in the appended claims.

Also, it is noted that individual embodiments may be described as aprocess which is depicted as a flowchart, a flow diagram, a data flowdiagram, a structure diagram, or a block diagram. Although a flowchartmay describe the operations as a sequential process, many of theoperations can be performed in parallel or concurrently. In addition,the order of the operations may be re-arranged. A process may beterminated when its operations are completed, but could have additionalsteps not discussed or included in a figure. Furthermore, not alloperations in any particularly described process may occur in allembodiments. A process may correspond to a method, a function, aprocedure, a subroutine, a subprogram, etc. When a process correspondsto a function, its termination corresponds to a return of the functionto the calling function or the main function.

The embodiments described herein may use the term document image(s),source image(s), digital document(s), and the like. The term generallydescribes a digitized document or computer readable file that representsany type of document. For example, as described herein, historicaldocuments, or other documents, are routinely digitized by scanning thedocument, photographing the document, and the like. These documents mayinclude handwritten on hand-drawn documents, machine-printed paperforms, government forms or records (e.g., birth certificate, marriagecertificate, census record, and the like), employment or educationrecords, legal records (e.g., a will, contract, and the like), medicalrecords, journals or diaries, and the like. These documents may havebeen photographed, microfilmed/microfiched, and/or duplicated throughany number of processes before being ultimately scanned or imaged(digitized) using a wide variety of modern digital imaging devices.Hence, the term document image refers generally to any digitized imageor document. For convenience, the disclosure herein will be mainlydirected to document images of genealogical records, however, it shouldbe realized that the disclosure is not limited to such documents.

The embodiments described herein also generally refer to a set or setsof document images, image sets, and the like. These terms generallyrefer to a plurality of document images, which may be included in asingle document or file, or included in multiple separate documents orfiles. For example, each document image may be included as separatepages within a single document or file, or may be included as separatedocuments or files altogether. Hence, the sets or sets of documentimages as described herein are not limited to a specific type unlessstated otherwise.

The embodiments described herein also use the term indexed data, indexeddata record, data record, and the like. These terms generally describecomputer readable files that include information that has been extractedor keyed from a document or document image. For example, as describedherein, historical or other documents may be analyzed or reviewed andthe information contained therein input, either manually or in anautomated fashion, into one or more other documents. A common manner inwhich this is performed is to have one or more individuals manuallyenter the information into a separate document, which may or may nothave predefined fields for such information. Common documents that are“keyed” in this manner include genealogical records, which may be“keyed” by employees, hobbyists, ancestors, and the like.

The embodiments described herein also generally refer indexed datasetsor sets of indexed data or data records. These terms generally refer toa plurality of indexed data records, which like the sets of documentimages, may be included in a single document or file, or included inmultiple separate documents or files. For example, each indexed datarecord may be included as separate pages within a single document orfile, or may be included as separate documents or files altogether.Hence, the indexed datasets are not limited to a specific type unlessstated otherwise.

Embodiments described herein also use the term linking to signify arelationship between various documents, records, and the like. This termgenerally means that some association has been made between the variousdocuments, records, and the like, such as to show a relationship orconnection between these documents. For example, a common link that isdescribed herein is a link between a document image and an indexed datarecord, or between two document images. The linkage may indicate thatthe indexed data record is connected with or has a relationship with thecorresponding document image, such as due to including informationobtained from the document image. Other linkages may also be made, suchas between two document images to indicate that they are images of thesame or similar documents, such as the same source document. Linkagesmay also be made between two indexed data records to indicate that theyinclude information obtained from the same or similar document images,and the like.

As described herein, a common problem with digitizing documents,especially historical documents, is the possibility of multiple “source”document images, or in other words, multiple copies of the same orsimilar documents. As can be understood, the quality of these documentimages in terms of size, resolution, legibility, and the like, can varywidely. Similarly, the indexed data records or datasets that result fromthe multiple copies may also vary. For example, some documents may havea more complete record of information, or have higher resolution andlegibility resulting in a better or slightly different indexed datarecord. In other instances, a poor quality document image may be linkedwith a corresponding and relatively high quality indexed data record,while a better quality document image is not linked with any indexeddata record at all. Ideally, the highest quality document images may belinked with the highest quality indexed data record so that users mayhave access to both a high quality document image and a highly accurateindexed data record.

Another common problem is the amount of time and resources that areinvested into linking various document images and indexed data records.For example, as described herein, such linkages are often performedmanually where a user manually reviews a document image and acorresponding indexed data record and verifies that the image and recordmatch. This process is often taxing on the user and expensive for acompany.

Embodiments of the invention describe novel systems and methods thatsimplify, accelerate, and improve the accuracy of image-to-datasetlinking. The systems and methods may also be used to automaticallyaudit, inspect, or validate existing data-image linking associations andto flag possible inconsistencies for subsequent review and correction.The embodiments described herein may use new and/or existing imageprocessing algorithms to perform such processes.

According to one embodiment, the methods and systems are directed towardsimplifying, speeding up, and/or automating the task of linking documentimages with indexed data records. The embodiments may also be used tolink the best data records with the highest quality document images. Theembodiments described herein leverage image processing algorithms toanalyze image information on a document image such as a number ofrecords or fields populated on a page, a detection of blank fields on apage, an identification of strikethrough or crossed out fields and otheridentifying features, and the like. The embodiments may also utilizehandwriting recognition algorithms to detect text from various fields onthe page. The handwriting recognition algorithms may detect individualwords and/or merely detect various handwriting styles. The algorithmsmay indicate when a high probability exists that the document imagecorresponds or matches the data of the indexed data record and thedocument image and indexed data record may be subsequently linked.

According to some embodiments, handwriting recognition may includerecognizing rudimentary handwriting to detect text from various fieldson the page and sync, or link to the best probability matches on thekeyed/indexed data. Stated differently, highly accurate handwritingrecognition is not necessarily required. Rather, simple recognition inconstrained domains, such as recognition of gender (M/F), age (0-120),relationship (father, mother, son, daughter, etc.), occupation, and thelike, may be all that is needed for effective linking, syncing, and/ormatching of indexed data to document images and for estimating theprobability that a proposed association is correct.

According to some embodiment, the image processing analysis may be usedin addition to, or alternatively, as an effective audit or validation ofinformation extraction or keying results in order to flag likelymismatches. For example, mismatches could be detected where a number ofdata records in the indexed dataset differ from the document imagerecords detected by image processing algorithms. In another embodiment,a difference in the arrangement of blank cells or strikethroughs couldbe detected, or a difference in the rudimentary handwriting on selectedfields may be recognized.

The embodiments, described herein may also be used for various documentsearch functions. For example, a query could be constructed to request asearch of a document image collection to look for and obtain the highestquality image that matches a given set of data parameters. Conversely,an image based search could be constructed to search an indexed datasetfor the best data to match the image. For example, a census could bescanned and the best resulting data record delivered to the user basedon the scanned image.

As described herein, there are many instances where it becomesadvantageous and/or even necessary to “link” or sync-up various indexeddatasets to corresponding document images within various imagecollections. For example, a document image collection may have beeninitially indexed using a poor set of images and a better set of imagesmay be later discovered. In this instance, it may be desirable toleverage the existing dataset, but “link” or associate each image fromthe “better” image set with the previously keyed data. According toanother example, a keyed or indexed dataset may be licensed or acquiredfrom a 3^(rd) party separate from the originally associated documentimages—possibly due to copyright restrictions, or any number of otherreasons. Again, it may be desirable and/or necessary to “link” theacquired dataset to existing images. According to yet another example, abetter or more accurate indexed dataset may be found and may need to belinked to a better set of document images. Alternatively, it may bepossible and desirable to link or sync the highest quality indexeddatasets with the highest quality document images, or even link multipledocument images and/or datasets to overcome transcription errors and thelike. Stated differently, when multiple indexed datasets are linked to agiven set of document images, the likelihood that at least one of thedatasets will have a correct transcription for a given transcribed itemincreases (e.g., a name, date, and the like).

In some instances, significant technical hurdles are overcome inperforming the document image/data record verification and linkingprocess. Specifically, the described image processing analysis may bedifficult to perform on old, low quality, diverse historical documents,such as old genealogical records, which are often handwritten bydifferent individuals. Most modern document processing technologies areconceived and implemented to work with modern office documents and arenot well equipped to handle poor image quality or handwritten documents,such as genealogical or other old records. The challenges in dealingwith such documents may be overcome by analyzing features of thedocument images rather than relying on identifying each written word.For instance, the layout of the document may be recognized, such as thenumber of entries, blank spaces, strikethroughs, ratio of male/femaleindividuals listed on the page, field code arrangements, occupations,various ages or dates listed, handwriting style, and the like. Anon-limiting list of the features of document image that may be analyzedto determine a match or non-match of a selected indexed data recordincludes:

Handwriting Recognition: according to one embodiment, handwritingrecognition may be applied to extract approximate representations offields on the page. The goal in handwriting recognition may be simply toget some of the characters in some of the fields on the page correct. Insome embodiments, the approximate location (e.g., XY coordinates) of theindexed data on a document image may be known or available since thedocument image, or more likely a similar document image, was previouslyindexed. Accordingly, handwriting recognition may involve a “fitting” or“registration” problem where the best fit of the known indexed data iscomputed onto or analyzed with the document image using the approximatelocation of the data. Techniques such as edit-distance mapping could beemployed to “fit” the indexed data on to the “recognized” document imagetemplates. Stated differently, the handwriting recognition could involvemore word spotting or handwriting characteristic spotting rather thanfull word or sentence recognition.

Blank Detection: Some fields on the document image page (or cells in thetable) may be left blank. Blank detection could be implemented to detectsuch blank fields on the page. Further, the indexed data could beshifted (e.g., the approximate data locations could be adjusted) when ablank row or cell is detected to achieve a best/matching fit of theindexed data. Stated differently, the XY position coordinates for theindexed data could be adjusted based on the detection of blank fields.In another embodiment, different document images may be matched andlinked by recognizing blanks fields or spaces that are common and uniqueto each document.

Strikethrough Detection: In some embodiments, crossed out entries incells and fields can throw off a sequencing or order when attempting tolink indexed data to document images. For example, an indexed datarecord may show one less data entry than the document image since one ofthe data entries has a strikethrough. Recognition that an entire or rowor column has a strikethrough may allow the processing algorithm toadjust the number of data entries recognized and thereby compensate forany data entries that may be missing from the data record. As with blankdetection, the purpose of strikethrough detection may be to provideshifting of indexed data fields in order to facilitate recognition ormatching of indexed data to the XY image positions of the documentimage. Having described several embodiments of the invention generally,additional aspects of the invention will be more evident with referencesto the figures.

Referring now to FIG. 1A, illustrated is a set 100 of similar documentsor records that need to be linked or synced together. Specifically, theset 100 includes a set of document images 102 and an index dataset 104.The set of document images 102 includes a plurality of individualdocument images 103 that may be sequentially arranged and/or linkedtogether. For example, the set of document images 102 may include aplurality of census or other records that are arranged and linkedaccording to alphabetical last name listing, date of creation, oraccording to any other manner. Each document image 103 is a digitalrepresentation or digitized form of a corresponding document, such as acensus record and the like. The set of document images 102 may includeduplicate document images or may be missing one or more documents orentries. In another embodiment, the document images 103 representconsecutive pages of a newspaper article, journal entry, family tree orother history, medical records, legal records, and the like. Further,each document image 103 may be a separate document, file, or record; maybe consecutive pages of a single document, record, or file; or anycombination thereof.

Set 100 also includes indexed datasets 104. Like the set of documentimages 102, the indexed datasets 104 include a plurality of individualdata records 105 that may be sequentially arranged and/or linked. Eachdata record 105 includes information that was extracted or keyed from adocument image as described herein. The information may be taken fromthe set of document images 102 or from another set of document images,with the latter scenario being more common. For example, as describedherein, companies often make digital copies of the same or similardocument. A specific example involves genealogical related companiesobtaining newly released genealogical information (e.g., censusinformation) and digitizing this information. Hobbyists and othersinvolved in genealogy may make similar digital copies. One or more ofthese groups, or each group, may key or extract information from thedigital copies with the result being several digital copies of the samedocument and several indexed data records obtained from the samedocument. As can be imagined, the quality of the digital copieddocuments may vary widely, as may the content of the keyed or extractedinformation.

In order to obtain and link the highest quality document images (i.e.,digital image) with the most accurate indexed dataset, one or morecompanies or hobbyists may agree to swap document images and indexeddatasets. In some embodiments, one company may have the document imageswhile another company has the indexed datasets for those documentimages. In any event, it is common that the indexed datasets 104 includeextracted or keyed data for a given set of document images 102 eventhough the data was not extracted or keyed from that exact set ofdocument images 102. Further, even though the data may be keyed from theset of document images 102, the individual document images 103 or datarecords 105 may be out of order with respect to one another.Conventional methods of linking the set of document images 102 with theindexed datasets 104 typically involve manual or semi-manual review ofthe document images 103 and data records 105.

In another embodiment, the indexed dataset 104 may include informationor data that was keyed or extracted directly from the set of documentimages 102. In such embodiments, it is often desirable or necessary toverify the information that was keyed or extracted for accuracy and/orto make sure that the order and arrangement of the data records 105corresponds with the order and arrangement of the document images 103.As with linking documents, conventional methods of verifying the data inthis manner typically involve manual review of the data records 105 anddocument images 103.

Referring now to FIG. 1B, an illustration 110 of document linkingprocess is provided. Specifically, FIG. 1B shows a document image 112being analyzed to verify that the document matches a corresponding datarecord 114. FIG. 1B further shows the document image 112 being linked116 with the data record 114 if the match between the two documents isconfirmed.

As described herein, data record 114 includes information that is keyedor extracted from a digitally copied document (i.e., a source document).The purpose of the analysis of document image 112 is to confirm that thedocument image 112 is a digital copy of the same or a very similarsource document. For example, data record 114 may include informationthat was keyed or extracted from a Mar. 5, 1810 census record thatbegins with an entry for John Smith and ends with an entry for TedSmith. Document image 112 may be a digital copy of the same censusrecord that was made by another company, a hobbyist, or some otherindividual or entity. Document image 112 may be analyzed to determinethat indeed the documents (i.e., document image 112 and the documentfrom which the information for data record 114 was obtained) bothcorrespond to the Mar. 5, 1810 census record. Upon such determination,the data record 114 and the document image 112 may be subsequentlylinked 116.

To analyze the document image 112, various features that are common toboth the document image 112 and the data record may be identified andanalyzed. The common features may include: a number of data entries, aratio of various information (e.g., a ratio of male to female, childrento parents or adults, and the like), a listing of specific occupations,a list of ages, birth dates, place of birth information, hometownlisting, physical features, governmental identifies, page numbers,number and/or size of data fields, formatting, handwriting style, fontstyle and size, and/or any other unique or semi-unique feature orcharacteristic. Virtually anything that is unique or a combination ofunique or semi-unique features may be analyzed to make a determinationthat the documents correspond with the same source document or includethe same information.

Since the data record 114 was keyed from a same source document, thedata record 114 may also include other details or information about thelayout of the source document, which information may also be used todetermine if the document image 112 corresponds with the sourcedocument. For example, the data record 114 may include tablecharacteristics of the source document or include information aboutblank or strikethrough data fields, rows, or columns. The document image112 may be analyzed to determine if the digitized document includesthose blank or strikethrough data fields, rows, columns, and the like.The data record 114 may also include information about handwriting styleor characteristics, which may be used in making a match. Essentially,the data record 114 may include fingerprint type information that isunique to a corresponding document image, which can be identified andused to match and link the data record 114 with a corresponding documentimage 112.

In one embodiment, analyzing the document image 112 to determine a matchbetween the document image 112 and the data record 114 involvesidentifying and analyzing multiple common features. For example, a firstcommon feature of the data record 114 may be identified and the documentimage 112 may be analyzed to determine whether the digitized documentincludes the identified first feature. Similarly, a second, third,fourth, and the like number of features may be identified from the datarecord 114 and the document image 112 may be analyzed to determine ifthose additional features are present. A confidence indicator or valuemay be assigned based on the presence of the features and may beadjusted with each verification that an identified feature is or is notpresent. The confidence indicator could be a value that represents thelikelihood or probability that the documents relate to the same sourcedocument.

For example, a number of data entries may be identified from the datarecord 114. The document image 112 may then be analyzed (e.g., using OCRtechniques) to determine the number of data entries in the documentimage 112. If the document image contains the same number of dataentries, a confidence indicator could be assigned. The analysis coulddetect and take into account any strikethrough entries, which may nothave been recorded in the data record. In this manner, a false negativecould be avoided, which otherwise may occur if strikethroughs are nottaken into account. A ratio of the number of males to females could thenbe identified from the data record 114 and the document image 112 couldbe analyzed (e.g., using OCR techniques) to determine the ratio of malesto females in the digitized document. If the number of data entries andthe ratio of male to female matches, the confidence indicator or valuecould be adjusted to show a higher likelihood that the documentscorrespond to the same source document. In contrast, if the male tofemale ratio does not match, the confidence indicator could be lowered.Further, listed occupations and/or the number and ratio of childrencould be identified from the data record 114 and compared with thedocument image 112 to determine if the document image 112's informationmatches the data record 114. Still further, stylistic or organizationalfeatures of the source document could be identified from the data record114 and the document image 112 could be analyzed to determine if thosestylistic or organizational features are present. Non-limiting examplesof such stylistic or organizational features include blank data fields,table information, handwriting style, data field number and size, columnand row headings, and the like.

Any number of features could be identified from the data record and itshould be realized that the information is not limited to personal orgenealogical information. For example, various prescriptions, dosageamount, medications, past illnesses, and the like can be identified froma medical data record and compared with a digitized copy of a medicalrecord to match the records. Similar logic applies to other documents,which may include legal documents, governmental documents, personaldocuments, ecclesiastical documents, and the like.

If a match between the document image 112 and the data record 114 ismade, such as by the confidence indicator being greater than a definedlevel, the two records may be associated or linked 116. If a match isnot made, the document image 112 or data record 114 may be replaced witha different record, respectively, and the process repeated. The processcould be repeated until an appropriate match is found.

As described previously, one application of this process is in searchingfor an existing document image 112 from a database when a data record114 is present, or vice versa. For example, if a user has a data record114 and they would like to obtain a digital copy of the document fromwhich that information was keyed or extracted, the user may key invarious features of the document as described above to search for adigital copy of the document. Similarly, if the user has a document, ora digital copy of a document, and would like to obtain a correspondingdata record, the user may scan in the digital copied document,photograph the document, or otherwise digitize the document and performa search for a corresponding data record. The above described method mayalso be used to validate an information keying or extraction process toensure that the information is correct and/or that the sequential orderof the data record and document image pages match.

Referring now to FIG. 1C, another illustration 120 of a document linkingprocess is provided. Specifically, FIG. 1C shows that a plurality ofdocument images 121-123 are being analyzed, matched, and linked with aplurality of data records 124-126. This process may be more useful inlinking a set of document images 102 with an indexed dataset 104.Specifically, a first document image 122 may be compared or matched witha first data record 125 as described above. If the documents match, thena link between the documents may be made. Likewise a preceding documentimage 121 in a sequentially ordered or arranged image set may becompared with a preceding data record 124 in a sequentially ordered orarranged dataset. If a match is made between the preceding image anddata record, 121 and 124, the records may be linked. Further asubsequent document image 123 in the sequentially ordered or arrangedimage set may be compared with a subsequent data record 126 in thesequentially ordered or arranged dataset. If a match is made between thesubsequent image and data record, 123 and 126, the records may belinked. Any desired number of preceding and subsequent images and datarecords may be analyzed and verified.

In some embodiment, analyzing the preceding and subsequent images anddata records may facilitate in determining that a given image and datarecord, 122 and 125, are indeed related and correspond to the samesource document. In another embodiment, verification that the given,preceding, and subsequent images and data records may verify that theset of document images and indexed dataset are order or arrangedaccording to a known sequence and the remaining unanalyzed images anddata records may be subsequently linked. To further verify that theimages and data records are not out of order, every nth preceding orsubsequent image and data record may be compared and verified asdescribed herein. If it is determined that a data record or documentimage is missing, the sequential arrangement of the dataset or set ofdocument images may be adjusted accordingly.

In yet other embodiment, the described process of verifying precedingand subsequent images and data records may be performed for each imageand data record in the respective sets to link the set of documentimages and indexed datasets together. This process may ensure thatlinking errors are not made when a dataset and/or set of images containsduplicate or missing documents or records.

Referring now to FIG. 1D, another illustration 130 of a linking processis provided. In this illustration 130, a first document image 132 mayalready be linked with a corresponding data record 136 (shown by thelarge grey arrow). Similarly, a second document image 134 may also belinked with a corresponding data record 138 (shown by the large greyarrow). This situation is common when the data record (e.g., 136)includes information keyed or extracted from a document image (e.g.,132) in a common or single library, such as when a given entity owns thedocument image and extracts the information to produce a data record andthe two records are maintained in the same library. In such instances,it may be desirable to link data records and/or document images in onelibrary with similar data records and/or document images from anotherlibrary. FIG. 1D illustrates such a process.

These cross library linkages may be made in several ways. For example,the first document image 132 may be compared with the second data record138 to determine if the records match. If the records do match, a linkmay be made between the records as shown by the diagonal arrowconnecting the records. Based on the match between the first documentimage 132 and the second data record 138, it may be determined that thefirst and second document images, 132 and 134, correspond with the samesource document and these two document images may be linked as shown bythe vertical arrow. It may also be determined that the first and seconddata records, 136 and 138, similarly match based on the previous matchesand these two records may likewise be linked as shown by the verticalerror. The second document image 134 may similarly be compared andlinked with the first data record 136 and the other records maysimilarly be matched and linked.

In another embodiment, the first and second document images, 132 and134, may be analyzed and linked using the processes described herein.For example, the first document image 132 may be analyzed to identifyone or more features unique to the first document image, such ashandwriting style, formatting, layout, data entries, identified words ortext, strikethroughs, blank spaces, data field arrangement, data fieldsize, data field layout, and the like. The second document image 134 maythen be analyzed to determine if the digitized document includes theseidentified features. If a match is made, the first and second images,132 and 134, can be linked as shown by the vertical arrow and thecorresponding data records may likewise be linked as described herein.Since the first and second document images essentially correspond to thesame source document, these documents essentially have a unique documentfingerprint that can be identified and verified to match the twodocuments.

Similarly, features of the first and second data records, 136 and 138,can be identified and the records can be compared and matched asdescribed herein and shown by the vertical arrow. The first and seconddata records, 136 and 138, will likewise have a unique data fingerprintthat can be determined and used in matching and linking these records.The corresponding document images may then likewise be linked asdescribed herein.

Referring now to FIG. 2A, illustrated is one embodiment of a documentimage 210 that is a digital representation of a digitized image of adocument or record. In this embodiment, the document or record is ahistorical census document. As shown in the figure, the census wasfilled out by hand and contains various information about individualslisted on the census record including individuals' names, ages,relationships, and the like. One of the entries has a strikethrough andseveral of the fields are left blank or have strikethroughs. Some of theletters are relatively illegible or difficult to discern. It is easilyunderstandable that current optical character recognition (OCR) or otherimaging software may have a difficult time accurately recognizing eachcharacter of the handwritten text. Further, other similar censusdocuments may have been filled out by another individual making OCR evenmore difficult. As can also be easily understood, the document image 210has very unique or characteristic features or a document fingerprint,such as the layout, number of data entries, information in the datafields, strikethroughs or blank sections, name information, handwritingstyle, and the like. These unique features enable the document to beidentified as described herein and a match and/or linkage to be madewith other document images, data records, and the like. The data recordmay contain information about one or more of these unique features ofdocument image 210 to enable the matching and linking as describedherein. Other documents likewise have unique features or a documentfingerprint that enables identification, matching, linking, and thelike.

With reference to FIG. 2B, illustrated is a simple representation ofanother document image 220 that is a digital representation or digitizedform of a document. In this embodiment, document image 220 is a simplerepresentation of a birth certificate, although in other embodiments,document image 220 may be any type of document, such as a book, a deathcertificate, an article, a legal or medical document, a newspaperclipping, a public record, an employee record, sales record, and thelike. Document image 220 includes information about an individual,place, or thing, which in this embodiment is information about the birthof an individual. Document image 220 includes a title 202 thatidentifies the type of document, a last name field 204 that identifiesthe last name of the identified individual, one or more date fields 206that show when the document was created and/or when an event occurred(e.g., the date of birth). Document image 220 may also includeauthentication indicia 208, such as a seal or stamp that verifies theauthenticity of the document. The authenticity of the document may beimportant if questions related to the validity of the information withinthe document arise, such as when the information within separatedocuments differ. For example, in a genealogical record, the date ofbirth of an individual in a birth certificate may differ from the dateof birth of the same individual in a family history book. In suchinstances, the authenticity indicia 208 may be relied upon to verify thedate of birth information. Like document image 210, document image 220also has unique features that enable identification, matching, linking,and the like.

Referring now to FIG. 3, illustrated is a diagram 300 of a process forlinking a set of document images with an indexed dataset. The process300 begins at step 302 where document images are digitized, such as byscanning or photographing the documents, or a set of digitized images isprovided. The document image set is shown by the designation Image-setI. Similarly, at step 306 an indexed dataset is provided or formed bykeying or extracting information from a digitized image, which istypically, but not necessarily, a different digitized image from theones being analyzed. The indexed dataset is shown by the designationData-set D. At step 310 a target document image is loaded. The loadeddocument image is designated I_(pos) _(_) _(i). At step 314, a targetdata record is loaded. The loaded data record is designated D_(pos) _(_)_(d).

At step 322, image processing analysis may be performed as describedherein to determine the unique or semi-unique features of the loadeddocument image. In some embodiments, the image processing analysis mayinclude: page layout analysis, table detection, field detection, blankcell detection, strikethrough detection, handwriting recognition, wordspotting, and the like, as described previously. At step 326, a bestmatch analysis may be performed to determine if the loaded documentimage matches or corresponds with the loaded data record. The computedmatch between the document image and the data record is designated as m.As described herein the data or information of the loaded data recordmay be used to determine a match between the two records. A confidenceindicator could be calculated based on the match between the documentimage and the data record. The computed match and resulting confidentindicator may be based on one, two, or more compared features of thedocument image and the data record.

If the match m does not exceed a defined confidence threshold ct, ormore precisely, if the compute confidence indicator does not exceed thedefined confidence threshold ct, then the documents may not be matched.In some embodiments, when the match or confidence indicator does notexceed the confidence threshold ct, or is relatively close to thethreshold, a manual intervention step 318 may be required or performed.This may allow a user to make a final decision as to the relationbetween the loaded document image and the data record to verify if therecords in fact match or not. Thus, the final decision may remain underthe user's control, which may be important for old documents, for whichimage processing may be relatively difficult.

If the ultimate match decision from the manual intervention step 318 isultimately negative or if step 318 is not performed, another data record(e.g., prior or subsequent data record designated as D_(pos) _(_) _(d−n)or D_(pos) _(_) _(d+n)) may be loaded at step 314 and the best matchanalysis may be performed at step 326. This process may be repeateduntil a matching data record is found or the data records of the indexeddataset (Data-set D) are exhausted. Alternatively, another documentimage (e.g., prior or subsequent document image designated as I_(pos)_(_) _(i−n) or I_(pos) _(_) _(i+n)) may be loaded at step 310 and theimage processing analysis at step 322 and the best match analysis atstep 326 may be performed. This process may likewise be repeated until amatching document image is found or the document images of the set ofdocument images (Image-set I) are exhausted.

If the match m or confidence interval is above the confidence thresholdin step 330, the process continues to step 334 where the linking processis performed to link or associate the matching document image and datarecord. At step 338, a determination is made as to whether the positionof the document image or data record is greater than a max position. Inother words, a determination is made as to whether the current documentis the last record in the indexed dataset (Data-set D) or the lastdocument image in the set of document images (Image-set I). If thedocument is not the last image or record in either set, the process isrepeated and a subsequent target document image is loaded at step 310and a subsequent target data record is loaded at step 314. If thecurrent document is the last record in the indexed dataset (Data-set D)or the last document image in the set of document images (Image-set I),the process ends at step 342.

With reference to FIG. 4, illustrated is a schematic of one embodimentof a computer system 400 that can perform the methods of the invention,as described herein. For example, the computer system can function as asystem that is able to perform the image processing analysis, best matchanalysis, linking process, and the like. It should be noted that FIG. 4is meant only to provide a generalized illustration of variouscomponents, any or all of which may be utilized as appropriate. FIG. 4,therefore, broadly illustrates how individual system elements may beimplemented in a relatively separated or relatively more integratedmanner.

The computer system 400 is shown comprising hardware elements that canbe electrically coupled via a bus 405 (or may otherwise be incommunication, as appropriate). The hardware elements can include one ormore processors 410, including, without limitation, one or moregeneral-purpose processors and/or one or more special-purpose processors(such as digital signal processing chips, graphics acceleration chips,and/or the like); one or more input devices 415, which can include,without limitation, a mouse, a keyboard and/or the like; and one or moreoutput devices 420, which can include, without limitation, a displaydevice, a printer and/or the like.

The computer system 400 may further include (and/or be in communicationwith) one or more storage devices 425, which can comprise, withoutlimitation, local and/or network accessible storage and/or can include,without limitation, a disk drive, a drive array, an optical storagedevice, a solid-state storage device, such as a random access memory(“RAM”) and/or a read-only memory (“ROM”), which can be programmable,flash-updateable and/or the like. The computer system 400 might alsoinclude a communications subsystem 430, which can include withoutlimitation a modem, a network card (wireless or wired), an infra-redcommunication device, a wireless communication device and/or chipset(such as a Bluetooth® device, an 802.11 device, a WiFi device, a WiMaxdevice, cellular communication facilities, etc.), and/or the like. Thecommunications subsystem 430 may permit data to be exchanged with anetwork, and/or any other devices described herein. In many embodiments,the computer system 400 will further comprise a working memory 435,which can include a RAM or ROM device, as described above.

The computer system 400 can also comprise software elements, shown asbeing currently located within the working memory 435, including anoperating system 440 and/or other code, such as one or more applicationprograms 445, which may comprise computer programs of the invention,and/or may be designed to implement methods of the invention and/orconfigure systems of the invention, as described herein. Merely by wayof example, one or more procedures described with respect to themethod(s) discussed above might be implemented as code and/orinstructions executable by a computer (and/or a processor within acomputer). A set of these instructions and/or code might be stored on acomputer readable storage medium, such as the storage device(s) 425described above. In some cases, the storage medium might be incorporatedwithin a computer system, such as the system 400. In other embodiments,the storage medium might be separate from a computer system (e.g., aremovable medium, such as a compact disc, etc.), and or provided in aninstallation package, such that the storage medium can be used toprogram a general purpose computer with the instructions/code storedthereon. These instructions might take the form of executable code,which is executable by the computer system 400 and/or might take theform of source and/or installable code, which, upon compilation and/orinstallation on the computer system 400 (e.g., using any of a variety ofgenerally available compilers, installation programs,compression/decompression utilities, etc.) then takes the form ofexecutable code.

It will be apparent to those skilled in the art that substantialvariations may be made in accordance with specific requirements. Forexample, customized hardware might also be used, and/or particularelements might be implemented in hardware, software (including portablesoftware, such as applets, etc.), or both. Further, connection withother computing devices such as network input/output devices may beemployed.

In one aspect, the invention employs a computer system (such as thecomputer system 400) to perform methods of the invention. According to aset of embodiments, some or all of the procedures of such methods areperformed by the computer system 400 in response to processor 410executing one or more sequences of one or more instructions (which mightbe incorporated into the operating system 440 and/or other code, such asan application program 445) contained in the working memory 435. Suchinstructions may be read into the working memory 435 from anothermachine-readable medium, such as one or more of the storage device(s)425. Merely by way of example, execution of the sequences ofinstructions contained in the working memory 435 might cause theprocessor(s) 410 to perform one or more procedures of the methodsdescribed herein.

The terms “machine-readable medium,” “computer-readable medium,” and“computer-readable storage medium,” as used herein, refer to any mediumthat participates in providing data that causes a machine to operate ina specific fashion. In an embodiment implemented using the computersystem 400, various machine-readable media might be involved inproviding instructions/code to processor(s) 410 for execution and/ormight be used to store and/or carry such instructions/code (e.g., assignals). In many implementations, a computer readable medium is aphysical and/or tangible storage medium. Such a medium may take manyforms, including but not limited to, non-volatile media, volatile media,and transmission media. Non-volatile media includes, for example,optical or magnetic disks, such as the storage device(s) 425. Volatilemedia includes, without limitation, dynamic memory, such as the workingmemory 435. Transmission media includes coaxial cables, copper wire, andfiber optics, including the wires that comprise the bus 405, as well asthe various components of the communication subsystem 430 (and/or themedia by which the communications subsystem 430 provides communicationwith other devices). Hence, transmission media can also take the form ofwaves (including without limitation radio, acoustic and/or light waves,such as those generated during radio-wave and infra-red datacommunications).

Common forms of physical and/or tangible computer readable mediainclude, for example, a floppy disk, a flexible disk, hard disk,magnetic tape, or any other magnetic medium, a CD-ROM, any other opticalmedium, punchcards, papertape, any other physical medium with patternsof holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chipor cartridge, a carrier wave as described hereinafter, or any othermedium from which a computer can read instructions and/or code.

Various forms of machine-readable media may be involved in carrying oneor more sequences of one or more instructions to the processor(s) 410for execution. Merely by way of example, the instructions may initiallybe carried on a magnetic disk and/or optical disc of a remote computer.A remote computer might load the instructions into its dynamic memoryand send the instructions as signals over a transmission medium to bereceived and/or executed by the computer system 400. These signals,which might be in the form of electromagnetic signals, acoustic signals,optical signals and/or the like, are all examples of carrier waves onwhich instructions can be encoded, in accordance with variousembodiments of the invention.

The communications subsystem 430 (and/or components thereof) generallywill receive the signals, and the bus 405 then might carry the signals(and/or the data, instructions, etc., carried by the signals) to theworking memory 435, from which the processor(s) 405 retrieves andexecutes the instructions. The instructions received by the workingmemory 435 may optionally be stored on a storage device 425 eitherbefore or after execution by the processor(s) 410.

While the invention has been described with respect to exemplaryembodiments, one skilled in the art will recognize that numerousmodifications are possible. For example, the methods and processesdescribed herein may be implemented using hardware components, softwarecomponents, and/or any combination thereof. Further, while variousmethods and processes described herein may be described with respect toparticular structural and/or functional components for ease ofdescription, methods of the invention are not limited to any particularstructural and/or functional architecture but instead can be implementedon any suitable hardware, firmware and/or software configuration.Similarly, while various functionality is ascribed to certain systemcomponents, unless the context dictates otherwise, this functionalitycan be distributed among various other system components in accordancewith different embodiments of the invention.

Moreover, while the procedures comprised in the methods and processesdescribed herein are described in a particular order for ease ofdescription, unless the context dictates otherwise, various proceduresmay be reordered, added, and/or omitted in accordance with variousembodiments of the invention. Moreover, the procedures described withrespect to one method or process may be incorporated within otherdescribed methods or processes; likewise, system components describedaccording to a particular structural architecture and/or with respect toone system may be organized in alternative structural architecturesand/or incorporated within other described systems. Hence, while variousembodiments are described with—or without—certain features for ease ofdescription and to illustrate exemplary features, the various componentsand/or features described herein with respect to a particular embodimentcan be substituted, added and/or subtracted from among other describedembodiments, unless the context dictates otherwise. Consequently,although the invention has been described with respect to exemplaryembodiments, it will be appreciated that the invention is intended tocover all modifications and equivalents within the scope of thefollowing claims.

What is claimed is:
 1. A method of linking a set of document images withan indexed dataset, in order to link document images as corresponding tothe same source document, the method comprising: providing a set ofdocument images, the document images being digitized documents havinginformation about individuals, places, or things; providing an indexeddataset, the indexed dataset including data records that each includeinformation extracted or keyed from a document image of the set ofdocument images or a similar set of document images; selecting a datarecord of the indexed dataset; identifying one or more features of thedata record, the one or more features being defined by a set of theinformation extracted or keyed from one of the document images;selecting a first document image of the set of document images;analyzing the first document image to determine that the one or morefeatures of the data record are present within the information of thefirst document image; determining that the data record corresponds withthe first document image based on the presence of the one or morefeatures within the information of the first document image; selecting asecond document image of the set of document images; analyzing thesecond document image to determine that the one or more features of thedata record are present within the information of the second documentimage; and linking at least a portion of the indexed dataset with bothfirst and second documents in the set of document images, reflectingthat the first and second documents correspond to the same sourcedocument.
 2. The method of claim 1, further comprising: selecting asecond data record of the indexed dataset; identifying a second one ormore features of the second data record, the second one or more featuresbeing defined by a set of the information extracted or keyed from adifferent one of the document images; determining that the second datarecord corresponds with the second document image based on the presenceof the second one or more features within the information of the seconddocument image; determining a confidence indicator based on the presenceof a first one or more features within the information of the firstdocument image, the confidence indicator indicating a probability thatthe first data record of indexed dataset corresponds with the firstdocument image of the set of document images; and adjusting theconfidence indicator based on the presence of the second one or morefeatures within the information of the second document image, theadjusted confidence indicator indicating an increased probability thatthe indexed dataset corresponds with the second document image of theset of document images.
 3. The method of claim 2, wherein the documentimages of the set of document images are arranged in sequential order,and wherein the data records of the indexed dataset are arranged insequential order, and wherein the method further comprises: determininga sequential arrangement of the first and second document images;determining a sequential arrangement of the first and second datarecords; determining a sequential arrangement for each of the remainingdocument images and data records based on the respective determinedsequential arrangements of the first and second document images and thefirst and second data records; and linking the data records with thedocument images based on the determined sequential arrangements of thedata records and document images.
 4. The method of claim 3, furthercomprising: determining that the indexed dataset is missing a datarecord or includes a duplicate data record; or determining that the setof document images is missing a document image or includes a duplicatedocument image; and adjusting the sequential order of the correspondingindexed dataset or set of document images based on the determination. 5.The method of claim 1, further comprising: linking the set of documentimages with a second set of document images based on the step of linkingthe data records with the document images, the second set of documentimages being digitized documents from which the information for theindexed dataset was extracted or keyed; or linking the indexed datasetwith a second indexed dataset based on the step of linking the datarecords with the document images, the second indexed dataset includinginformation extracted or keyed from the documents of the set of documentimages.
 6. The method of claim 1, wherein the set of document imagescomprises genealogical documents and the indexed dataset comprisesgenealogical data.
 7. The method of claim 1, wherein the set of documentimages is provided from a first source, and wherein the indexed datasetis provided from a second source different than the first source.
 8. Amethod of linking a document image with indexed data comprising:providing a document image, the document image being a digitizeddocument having information about an individual, place, or thing;providing indexed data, the indexed data including information extractedor keyed from a different document image; identifying a first feature ofthe indexed data, the first feature being defined by a set of theinformation extracted or keyed from the different document image;analyzing the document image to determine that the first feature ispresent within the information of the document image; determining thatthe indexed data corresponds with the document image based on thepresence of the first feature within the information of the documentimage; identifying a second feature of the indexed data, the secondfeature being defined by a subset of the information extracted or keyedfrom the different document image; analyzing the document image todetermine that the second feature is present within the information ofthe document image; determining that the indexed data corresponds withthe document image based on the presence of the first feature and thesecond feature within the information of the document image; determininga confidence indicator based on the presence of the first feature withinthe information of the document image, the confidence indicatorindicating a probability that the indexed data corresponds with thedocument image; adjusting the confidence indicator based on the presenceof the second feature within the information of the document image, theadjusted confidence indicator indicating an increased probability thatthe indexed data corresponds with the document image; and linking theindexed data with the document image.
 9. The method of claim 8, whereinthe first feature or the second feature comprises one or more selectedfrom the group consisting of: recognition of a number of record entriesof the document image; recognition of a sex of an individual orindividuals identified on the document image; handwriting recognition;strikethrough recognition; blank field recognition; table detection;field detection; and word spotting.
 10. The method of claim 9, wherein anumber of detected recorded entries for the document image is adjustedbased on a detection or recognition of a strikethrough for an entry. 11.The method of claim 9, wherein an approximate location on the documentimage is known for the information within the indexed data, and whereinthe method further comprises adjusting the approximate location for theinformation based on a detection of a blank field, row, or column ofinformation on the document image.
 12. The method of claim 8, furthercomprising: determining a third feature of the indexed data, the thirdfeature being defined by an additional subset of the informationextracted or keyed from the document image; analyzing the document imageto determine that the third feature is present within the information ofthe document image; and readjusting the confidence indicator based onthe presence of the third feature within the information of the documentimage.
 13. The method of claim 8, wherein the document image is receivedfrom a first source, and wherein the indexed data is received from asecond source different than the first source.
 14. The method of claim8, wherein the document image is a digitized genealogical document, andwherein the indexed data is genealogical data.
 15. A non-transitorycomputer readable medium having instruction encoded thereon, which whenexecuted by a processor, cause the processor to perform one or more ofthe following operations, in order to link document images asoriginating with the same source document: provide a set of documentimages, the document images being digitized documents having informationabout individuals, places, or things; provide an indexed dataset, theindexed dataset including data records that each include informationextracted or keyed from a document image of the set of document imagesor a similar set of document images; select a data record of the indexeddataset; identify one or more features of the first data record, the oneor more features being defined by a set of the information extracted orkeyed from one of the document images; select a first document image ofthe set of document images; analyze the first document image todetermine that the one or more features are present within theinformation of the first document image; determine that the data recordcorresponds with the first document image based on the presence of theone or more features within the information of the first document image;select a second document image of the set of document images; analyzethe second document image to determine that the one or more features arepresent within the information of the second document image; and link atleast a portion of the indexed dataset with the set of document images,reflecting that the first and second documents correspond to the samesource document.
 16. The non-transitory computer readable medium ofclaim 15, wherein the document images of the set of document images arearranged in sequential order, and wherein the data records of theindexed dataset are arranged in sequential order.
 17. The non-transitorycomputer readable medium of claim 16, wherein the operation furtherincludes: determining a sequential arrangement of the first and seconddocument images; determining a sequential arrangement of a first datarecord and a second data record; determining a sequential arrangementfor each of the remaining document images and data records based on therespective determined sequential arrangements of the first and seconddocument images and the first and second data records; and linking thedata records with the document images based on the determined sequentialarrangements of the data records and document images.
 18. Thenon-transitory computer readable medium of claim 17, wherein theoperation further includes: determining that the indexed dataset ismissing a data record or includes a duplicate data record; ordetermining that the set of document images is missing a document imageor includes a duplicate document image; and adjusting the sequentialorder of the corresponding indexed dataset or set of document imagesbased on the determination.